Random forest versus logistic regression: a large-scale benchmark experiment
نویسندگان
چکیده
The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. In this context, we present a large scale benchmarking experiment based on 260 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.032 (95%-CI=[0.025, 0.042]) for the accuracy, 0.043 (95%-CI=[0.032, 0.056]) for the Area Under the Curve, and −0.028 (95%-CI=[−0.036,−0.022]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were highly dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values. ∗[email protected] †[email protected] ‡[email protected]
منابع مشابه
Comparison of Random Forest and Logistic Regression Methods in Predicting Mortality in Colorectal Cancer Patients and its Related Factors
Background and Objectives: The purpose of this study was to predict the mortality rate of colorectal cancer in Iranian patients and determine the effective factors on the mortality of patients with colorectal cancer using random forest and logistic regression methods. Methods: Data from 304 patients with colorectal cancer registry from the Gastroenterology and Liver Research Center of Shah...
متن کاملExtreme Logistic Regression: A Large Scale Learning Algorithm with Application to Prostate Cancer Mortality Prediction
With the recent popularity of electronic medical records, enormous amount of medical data is being generated every day at an exponential rate. Machine learning methods have been shown in many studies to be capable of producing automatic medical diagnostic models such as automated prognostic models. However, many powerful machine learning algorithms such as support vector machine (SVM), Random F...
متن کاملJoint Maximum Purity Forest with Application to Image Super-Resolution
In this paper, we propose a novel random-forest scheme, namely Joint Maximum Purity Forest (JMPF), for classification, clustering, and regression tasks. In the JMPF scheme, the original feature space is transformed into a compactly pre-clustered feature space, via a trained rotation matrix. The rotation matrix is obtained through an iterative quantization process, where the input data belonging...
متن کاملSusceptibility Zoning of Dust Source Areas by Data Mining Methods over Khorasan Razavi Province
Extended abstract Introduction Dust storms are natural hazards that effect on weather conditions, human health and ecosystem. Atmospheric processes are directly affected by the absorption and diffusion of radiation by dust, and dust in the cloud acts as a nucleus of congestion. The main dust areas in the world are flat topographically dry areas with erosion-sensitive soil and poor vege...
متن کاملA Robust Missing Value Imputation Method MifImpute For Incomplete Molecular Descriptor Data And Comparative Analysis With Other Missing Value Imputation Methods
Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017